# Lab 23 - Confidence intervals for regression

We will use newborn data set from Lab 16, [babies.data](https://www.stat.berkeley.edu/~statlabs/data/babies.data).  This data is from a sample of newborns born between 1960 and 1967 in California in a major hospital system.

The columns are:<br>
bwt: Birth weight in ounces (999 unknown)<br>
gestation: Length of pregnancy in days (999 unknown)<br>
parity: 0= first born, 9=unknown<br>
age: mother's age in years (99 unknown)<br>
height: mother's height in inches (99 unknown)<br>
weight: Mother's prepregnancy weight in pounds (999 unknown)<br>
smoke: Smoking status of mother: 0=not now, 1=yes now, 9=unknown

First, let's import the necessary libraries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
%matplotlib inline

### Loading and cleaning the data

Read the CSV file into the dataframe `babies`.  As in Lab 16, we need to add the parameter `sep = "\s+"` since the columns are separated by whitespaces instead of commas.

Display your `babies` dataframe below to check it was created properly.

There are some unknown gestational days (represented with 999) in the dataset, which you can see by plotting the histogram of the `gestation` column.

Let's read in the dataset again, this time specifying "999" as a NaN value.  We did this in Lab 18 and 22 with the Starbucks data.

Now drop any row with an NaN value (also done in Lab 18 and 22):

Finally double check the unknown values have been dropped by plotting a histogram of the weight or gestation column again.

### Relationship between birth weight and gestational days.

In Lab 16 we looked at the relationship between birth weight and the number of gestational days (how long the baby was in the womb).  We will look at this relationship again using regression.

First, we'll visualize the data.  Use the Seaborn library (Lab 21) to plot a scatter plot with the regression line, where gestational days are on the x axis and birth weight (`bwt`) is on the y axis.

<details> <summary>Answer:</summary>
<code>sns.regplot(x = "gestation", y = "bwt", data = babies)</code>
</details>

There appears to be a weak linear relationship between the gestation days and birth weight.  Let's compute the correlation matrix to get the correlation between the two variables:

<details> <summary>Answer:</summary>
<code>babies.corr()</code>
</details>

What's the correlation between gestational days and birth weight?  Is it high, low, or in the middle?

Next let's calculate this slope by computing the linear model, and then displaying the model summary:

<details> <summary>Answer:</summary>
<code>lm = smf.ols(formula = "bwt ~ gestation", data = babies).fit()
lm.summary()</code>
</details>

What's the slope of the regression line?

Remember that this relationship is only based on a sample of data.  A different sample would give a different regression line and slope.  To understand how much the slope could change depending on the sample, we will compute the 95% confidence interval for the slope.  

First, we will create a function for computing the regression and extracting the slope from it:

In [None]:
# df is the name of the dataframe
# x is the name of the column for the independent variable
# y is the name of the column for the dependent variable
def slope(df, x, y):
    formula_string = y + " ~ " + x    # create a string containing the formula in advance
    lm = smf.ols(formula = formula_string, data = df).fit()
    return lm.params[1]

Let's try calling (running) this function on our data:

In [None]:
slope(babies,"gestation","bwt")

Was the slope the same as your previous computation?

To find the 95% confidence interval, we will:
- create an empty list to store the slopes
- take 1000 bootstrap samples the same size as the original data
- for each sample, compute the slope of the regression line using our function and save it in our list

The pseudo-code is:
<code>
slopes = []
loop 1000 times:
    take a bootstrap sample (take a sample with replacement the same size as the original data)
    compute the slope of the regression line of the sample
    add the sample to `slopes`
</code>

Try to write the actual code below:

<details><summary>Answer:</summary>
<code>slopes = []<br>
for i in range(1000):<br>
    sample = babies.sample(1188,replace = True)<br>
    sample_slope = slope(sample,"gestation","bwt")<br>
    slopes.append(sample_slope)</code>
</details>

Plot the histogram of the slopes of the samples:

What is the mean of the slopes?  How does this compare to the slope of the actual data?

Were any of the sample slopes 0? Does this suggest a different sample of the data could have slope 0?

Finally, let's compute the 95% confidence interval by computing the 0.025 quantile (2.5 percentile) and 0.975 quantile (97.5 percentile).  We can compute the 0.025 quantile with the code `pd.Series(slopes).quantile(0.025)`

What's your 95% confidence interval?

It should be close to (0.39,0.56).

### Maternal Age and Birth Weight

Let's look at whether there is a relationship between maternal age and birth weight.  That is, can maternal age in any way predict the birth weight?

Plot a histogram of the maternal ages:

What do you notice about the histogram?

It looks like someone quite old gave birth.  However, look at the descriptions of the columns at the top of the lab.  99 is used in the maternal age column to indicate the age is unknown.  Therefore, we want to also remove this row (with age 99) from our dataset.  But we can't just add 99 to the list of missing values, because a birth weight could be 99oz.  Therefore, we need to use the following code, which tells Pandas the different missing values for each column.

In [None]:
babies = pd.read_csv("../Data/babies.data",sep = "\s+",  na_values = {"gestation":"999","age":"99"})
babies = babies.dropna(axis = 0)

Try plotting a histogram of the maternal ages again.  Did we remove the outlier?

First plot a scatter plot and regression line of maternal age (x) vs. birth weight (y).

Does the slope look close to 0?  If the slope is 0 it would indicate no relationship.  However, for any particular sample of data, the slope is unlikely to be exactly 0.  Therefore, we want to construct the 95% confidence interval for the slope and check if it contains 0.

We will now compute the 95% confidence interval.  

First, compute 1000 bootstrap samples, saving the slope of the regression line of each sample.

Plot a histogram of the slopes:

Does it look like the 95% confidence interval for the slope will contain 0?

Let's formally calculate the 95% confidence interval for the slope by computing the 0.025 and 0.975 quantiles:

What's the 95% confidence interval for the slope?  Does it contain 0?

If so, we must conclude that there is not enough evidence to show a relationship between maternal age and birth weight.

### Challenges:
- Is there a relationship between maternal height and birth weight?  That is, can birth weight in any way be predicted from maternal height?